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Abstract 

In this paper we introduce a novel hash learning framework that has two main distinguishing features, 
when compared to past approaches. First, it utilizes codewords in the Hamming space as ancillary means 
to accomplish its hash learning task. These codewords, which are inferred from the data, attempt to 
capture similarity aspects of the data’s hash codes. Secondly and more importantly, the same framework 
is capable of addressing supervised, unsupervised and, even, semi-supervised hash learning tasks in a 
natural manner. A series of comparative experiments focused on content-based image retrieval highlights 
its performance advantages. ^ 
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1 Introduction 

With the explosive growth of web data including documents, images and videos, content-based image retrieval 
(CBIR) has attracted plenty of attention over the past years [3]. Given a query sample, a typical CBIR 
scheme retrieves samples stored in a database that are most similar to the query sample. The similarity is 
gauged in terms of a pre-specified distance metric and the retrieved samples are the nearest neighbors of the 
query point w.r.t. this metric. However, exhaustively comparing the query sample with every other sample 
in the database may be computationally expensive in many current practical settings. Additionally, most 
CBIR approaches may be hindered by the sheer size of each sample; for example, visual descriptors of an 
image or a video may number in the thousands. Furthermore, storage of these high-dimensional data also 
presents a challenge. 

Considerable effort has been invested in designing hash functions transforming the original data into 
compact binary codes to reap the benefits of a potentially fast similarity search; note that hash functions 
are typically designed to preserve certain similarity qualities between the data. For example, approximate 
nearest neighbors (ANN) search [22] using compact binary codes in Hamming space was shown to achieve 
sub-liner searching time. Storage of the binary code is, obviously, also much more efficient. 

Existing hashing methods can be divided into two categories: data-independent and data-dependent. The 
former category does not use a data-driven approach to choose the hash function. For example. Locality 
Sensitive Hashing (LSH) [4] randomly projects and thresholds data into the Hamming space for generating 
binary codes, where closely located (in terms of Euclidean distances in the data’s native space) samples are 
likely to have similar binary codes. Furthermore, in [9], the authors proposed a method for ANN search 
using a learned Mahalanobis metric combined with LSH. 

On the other hand, data-dependent methods can, in turn, be grouped into supervised, unsupervised 
and semi-supervised learning paradigms. The bulk of work in data-dependent hashing methods has been 
performed so far following the supervised learning paradigm. Recent work includes the Semantic Hashing 
[18], which designs the hash function using a Restricted Boltzmann Machine (RBM). Binary Reconstructive 

^This work has been accepted by ECML/PKDD 2015. Please cite the ECML version of this paper. 
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Embedding (BRE) in [10] tries to minimize a cost function measuring the difference between the original 
metric distances and the reconstructed distances in the Hamming space. Minimal Loss Hashing (MLH) [17] 
learns the hash function from pair-wise side information and the problem is formulated based on a bound 
inspired by the theory of structural Support Vector Machines [27]. In [16], a scenario is addressed, where a 
small portion of sample pairs are manually labeled as similar or dissimilar and proposes the Label-regularized 
Max-margin Partition algorithm. Moreover, Self-Taught Hashing [28] first identifies binary codes for given 
documents via unsupervised learning; next, classifiers are trained to predict codes for query documents. 
Additionally, Fisher Linear Discriminant Analysis (LDA) is employed in [21] to embed the original data to 
a lower dimensional space and hash codes are obtained subsequently via thresholding. Also, Boosting based 
Hashing is used in [20] and [1], in which a set of weak hash functions are learned according to the boosting 
framework. In [11], the hash functions are learned from triplets of side information; their method is designed 
to preserve the relative relationship reflected by the triplets and is optimized using column generation. 
Finally, Kernel Supervised Hashing (KSH) [13] introduces a kernel-based hashing method, which seems to 
exhibit remarkable experimental results. 

As for unsupervised learning, several approaches have been proposed: Spectral Hashing (SPH) [26] designs 
the hash function by using spectral graph analysis with the assumption of a uniform data distribution. 
[14] proposed Anchor Graph Hashing (AGH). AGH uses a small-size anchor graph to approximate low- 
rank adjacency matrices that leads to computational savings. Also, in [5], the authors introduce Iterative 
Quantization, which tries to learn an orthogonal rotation matrix so that the quantization error of mapping 
the data to the vertices of the binary hypercube is minimized. 

To the best of our knowledge, the only approach to date following a semi-supervised learning paradigm 
is Semi-Supervised Hashing (SSH) [25] [24] . The SSH framework minimizes an empirical error using labeled 
data, but to avoid over-fitting, its model also includes an information theoretic regularizer that utilizes both 
labeled and unlabeled data. 

In this paper we propose *Supervised Hash Learning (*SHL) (* stands for all three learning paradigms), 
a novel hash function learning approach, which sets itself apart from past approaches in two major ways. 
First, it uses a set of Hamming space codewords that are learned during training in order to capture the 
intrinsic similarities between the data’s hash codes, so that same-class data are grouped together. Unlabeled 
data also contribute to the adjustment of codewords leveraging from the inter-sample dissimilarities of their 
generated hash codes as measured by the Hamming metric. Due to these codeword-specific characteristics, 
a major advantage offered by *SHL is that it can naturally engage supervised, unsupervised and, even, 
semi-supervised hash learning tasks using a single formulation. Obviously, the latter ability readily allows 
*SHL to perform transductive hash learning. 

In Section 2, we provide *SHL’s formnlation, which is mainly motivated by an attempt to minimize the 
within-group Hamming distances in the code space between a group’s codeword and the hash codes of data. 
With regards to the hash fnnctions, *SHL adopts a kernel-based approach. The aforementioned formulation 
eventually leads to a minimization problem over the codewords as well as over the Reprodncing Kernel Hilbert 
Space (RKHS) vectors defining the hash functions. A quite noteworthy aspect of the resulting problem is 
that the minimization over the latter parameters leads to a set of Support Vector Machine (SVM) problems, 
according to which each SVM generates a single bit of a sample’s hash code. In lieu of choosing a fixed, 
arbitrary kernel function, we nse a simple Multiple Kernel Learning (MKL) approach {e.g. see [8]) to infer a 
good kernel from the data. We need to note here that Self-Taught Hashing (STH) [28] also employs SVMs to 
generate hash codes. However, STH differs significantly from *SHL; its unsupervised and supervised learning 
stages are completely deconpled, while *SHL nses a single cost function that simultaneously accommodates 
both of these learning paradigms. Unlike STH, SVMs arise naturally from the problem formulation in *SHL. 

Next, in Section 3, an efficient Majorization-Minimization (MM) algorithm is showcased that can be used 
to optimize *SHL’s framework via a Block Goordinate Descent (BCD) approach. The first block optimization 
amounts to training a set of SVMs, which can be efficiently accomplished by using, for example, LIBSVM 
[2]. The second block optimization step addresses the MKL parameters, while the third one adjnsts the 
codewords. Both of these steps are compntationally fast due to the existence of closed-form solutions. 

Finally, in Section 5 we demonstrate the capabilities of *SHL on a series of comparative experiments. 
The section emphasizes on supervised hash learning problems in the context of CBIR, since the majority 
of hash learning approaches address this paradigm. We also included some preliminary transdnctive hash 
learning resnlts for *SHL as a proof of concept. Remarkably, when compared to other hashing methods on 
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supervised learning hash tasks, *SHL exhibits the best retrieval accuracy for all the datasets we considered. 
Some clues to *SHL’s superior performance are provided in Section 4. 


2 Formulation 

In what follows, [•] denotes the Iverson bracket, ie., [predicate] = I, if the predicate is true, and [predicate] = 
0, if otherwise. Additionally, vectors and matrices are denoted in boldface. All vectors are considered column 
vectors and denotes transposition. Also, for any positive integer K, we define Nif = {1,..., AT}. 

Central to hash function learning is the design of functions transforming data to compact binary codes 
in a Hamming space to fulfill a given machine learning task. Consider the Hamming space = {—1,1}^, 
which implies H-bit hash codes. *SHL addresses multi-class classification tasks with an arbitrary set X 
as sample space. It does so by learning a hash function h : A —>■ and a set of G labeled codewords 
9 ^ Ng (each codeword representing a class), so that the hash code of a labeled sample is mapped close 
to the codeword corresponding to the sample’s class label; proximity is measured via the Hamming distance. 
Unlabeled samples are also able to contribute to learning both the hash function and the codewords as it 
will demonstrated in the sequel. Finally, a test sample is classified according to the label of the codeword 
closest to the sample’s hash code. 

In *SHL, the hash code for a sample x G A is eventually computed as h(x) = sgnf(x) G where 
the signum function is applied component-wise. Furthermore, f(x) = [/i(x).../ b(x)]^, where fb(x) = 
(wb, + Pb with Wb G ftwb — {wb G TLb ■ — ^b,Rb > 0} and (3b G M. for all b G Nb- In the 

previous definition, Hb is a RKHS with inner product (•, induced norm |[rcf,|lBt ~ {wb^ Wb)gi^ for all 
Wb G 'Hbj associated feature mapping (j)b X ^ 'Hb and reproducing kernel fcf, : A x A —> R, such that 
kb{x, x') = {(j)b{x),(j)b{x')).^^ for all x, x' G A. Instead of a priori selecting the kernel functions kb, MKL [8] is 
employed to infer the feature mapping for each bit from the available data. In specific, it is assumed that each 
RKHS Hb is formed as the direct sum of M common, pre-specified RKHSs Hm, *.e., Hb = 0^ y/db^mHm, 

where 6b = [0b,i ■ ■ - db^M^ G fig = G ^ h 0 , ||0||p < l,p > l|, ^ denotes the component-wise > 

relation, ||•||p is the usual Ip norm in and m ranges over Nm- Note that, if each preselected RKHS Hm 
has associated kernel function km, then it holds that kb{x,x') = J2m^>>,mkmix,x') for all x,x' G A. 

Now, assume a training set of size N consisting of labeled and unlabeled samples and let A/l and Afjj 
be the index sets for these two subsets respectively. Let also for n G AFl be the class label of the 
labeled sample. By adjusting its parameters, which are collectively denoted as uj, *SHL attempts to reduce 
the distortion measure 


E{uj)^ d{h{xn),fj.ij + Y nund(h(x„),/Xg) (1) 

n^Afu 

where d is the Hamming distance defined as d(h, h') = [ht ^ h],]. However, the distortion E is difficult to 
directly minimize. As it will be illustrated further below, an upper bound E oi E will be optimized instead. 

In particular, for a hash code produced by *SHL, it holds that d(h(x),/x) = 

Ylb [Abfb{x) < 0]. If one defines (J(f,/.t) = Ylb ~ Mfe/fcj+j where [«]_,_ = max{0,M} is the hinge function, 
then d (sgnf, fi) < d (f, fi) holds for every f G R'® and any /.t G in'®. Based on this latter fact, it holds that 


E{u}) < E{u}) ^YYI (f(x„), /x^) 

g n 


( 2 ) 


where 


A 

7g,n — 


[9 — ^n] 

[g = arg min^/ d (f (x„), /x^,) ] 


nG Ml 
n G Mu 


( 3 ) 


It turns out that E, which constitutes the model’s loss function, can be efficiently minimized by a three-step 
algorithm, which delineated in the next section. 


3 




3 Learning Algorithm 

The next proposition allows us to minimize E as defined in Equation (2) via a MM approach [7], [ 6 ]. 
Proposition 1. For any *SHL parameter values tv and u)', it holds that 


E(u}) < E(u}\u3') = ^ ^ ig^nd (f(a;n), Pg) (4) 

9 n 

where the primed quantities are evaluated on u)' and 


f A \ [9 = ln\ U^Ml 

\ [5 = argmiug/J(f'(x„),Pg,)] n e Mu 


( 5 ) 


Additionally, it holds that E{oj\ijj) = E(lo) for any uj. In summa, E{-\-) majorizes E{-). 

Its proof is relative straightforward and is based on the fact that for any value of 7 ' „ G {0,1} other than 
7 g^„ as defined in Equation (3), the value of E{u>\uj') can never be less than E{u>\uj) = E{uj). 

The last proposition gives rise to a MM approach, where M are the current estimates of the model’s 
parameter values and E{tjj\ijj') is minimized with respect to a> to yield improved estimates u;*, such that 
E{uj*) < E{u)'). This minimization can be achieved via a BCD. 

Proposition 2. Minimizing E{-\ijj') with respect to the Hilbert space vectors, the offsets fdp and the MKL 
weights 9h, while regarding the codeword parameters as constant, one obtains the following B independent, 
equivalent problems: 


inf 


41^ fe, m ^'H m ,’Tl ^ N M 


9 n 





6gNb 


( 6 ) 


where fb{x) = 4>m(x))g^ + j5b and C > 0 is a regularization constant. 

The proof of this proposition hinges on replacing the (independent) constraints of the Hilbert space 
vectors with equivalent regularization terms and, finally, performing the substitution Wh,m \/ Sb,m.Wb,m as 
typically done in such MKL formulations {e.g. see [ 8 ]). Note that Problem ( 6 ) is jointly convex with respect 
to all variables under consideration and, under closer scrutiny, one may recognize it as a binary MKL SVM 
training problem, which will become more apparent shortly. 

First block minimization: By considering Wb,m and fdb for each 6 as a single block, instead of directly 
minimizing Problem ( 6 ), one can instead maximize the following problem: 

Proposition 3. The dual form of Problem (6) takes the form of 


sup a^lNG - ^alT>b[{lG^G)‘^^b]'Dbab beNs (7) 

where Ik stands for the all ones vector of K elements (K G N), fi^, = [fii^b ■ ■ ■ Dfe — diag (pj 0 Iat), 

Kh = db,m^m, whcrc is the data’s kernel matrix, Qa,, = {a G : a.f (pj, 0 Ia?) = 0 , 0 ^ ^ C't^} 

andy = [7( i,...,7( jy,7^_i,...,7^^y]’^. 

Proof. After eliminating the hinge function in Problem ( 6 ) with the help of slack variables fg „, we obtain 
the following problem for the first block minimization: 
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min 


g n m 



Ob. 


,m 


S.t. > 0 

^g,n ^ 1 ~ ^ {Wb,rm 4’m{x)) + /3h)/ig,b 

m 


Due to the Representer Theorem {e.g., see [19]), we have that 


(8) 


'^b,m — Ob,m ^b,n^m(^n) (9) 

n 

where n is the training sample index. By defining G to be the vector containing all r]^ = 

[?7b,ij Tib, 2 -, •■•, ??b,Ar]^ G R^ and /i.{, = [/ii,b, /^2,b, ■•■, MG,b]^ G the vectorized version of Problem ( 8 ) in light 
of Equation (9) becomes 


min + l-rjl Ktrjf, 

^b4b,Pb z 

s.t. $f,hO 

^b ^ IwG - (Mb ® Kb)T7^ - (/Xf, 0 lN)Pb (10) 

where 7 ' and Kb are defined in Proposition 3. From the previous problem’s Lagrangian C, one obtains 

dC f Ab = C'f’ - OLb 

dib \o ^ ab ^ Ct' 

dC 

_ = 0 ^ af 0 Iat) = 0 
riC qT<r-i 

_=0 ^ r,, = K-i(/.,0Kb)^ab 

where ab and Ab are the dual variables for the two constraints in Problem (10). Utilizing Equation (11), 
Equation (12) and Equation (13), the quadratic term of the dual problem becomes 


( 11 ) 

( 12 ) 

(13) 


(/lb 0 Kb)K”^(/if 0 Kb) = 

= (/lb ® Kb)(l 0 K^^)(/i^ 0 Kb) 

= (/lb 0 lNxN){fJ.l ® Kb) 

= (/ibMD®Kb (14) 

Equation (14) can be further manipulated as 

(/ib/if) 0 Kb = 

= [(diag (/lb) lG)(diag (/ib) Ig)'^] ® Kb 
= [diag (/lb) (IgIg) diag (/ib)] 0 [IjvKblAr] 

= [diag (/lb) 0 lAr][(lGlG) ® Kb] [diag (/ib) 0 Iat] 

= [diag (/lb 0 Iat)] [(IgIg) ® Kb] [diag (/ib 0 Ia?)] 

= Db[(lGlG)®Kb]Db (15) 
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Algorithm 1 Optimization of Problem (6) 

Input: Bit Length Training Samples X containing labeled or unlabled data. 
Output: u). 

1. Initialize w. 

2. While Not Converged 

3. For each bit 

4. Equation (5). 

5. Step 1: Wb^m ^ Equation (7). 

6 . j5b ■‘r- Equation (7). 

7. Step 2: Compute ||wb,m||^ ■ 

8. 9b^m ^ Equation (16). 

9. Step 3: /ig^b ^ Equation (17). 

10. End For 

11. End While 

12. Output o). 


The first equality stems from the identity diag (v) 1 = v for any vector v, while the third one stems form 
the mixed-product property of the Kronecker product. Also, the identity diag (v (g) 1) = diag {v) g) I yields 
the fourth equality. Note that is defined as in Proposition 3. Taking into account Equation (14) and 
Equation (15), we reach the dual form stated in Proposition 3. □ 

Given that 7g „ S {0,1}, one can easily now recognize that Problem (7) is an SVM training problem, 
which can be conveniently solved using software packages such as LIBSVM. After solving it, obviously one can 
compute the quantities {wb,m,4>Tn{x))^ , /?& and ||u>;,,rn||^ , which are required in the next step. 

Second block minimization: Having optimized over the SVM parameters, one can now optimize the 
cost function of Problem (6) with respect to the MKL parameters 6b as a single block using the closed-form 
solution mentioned in Prop. 2 of [8] for p > 1 and which is given next. 


^b,m — 


II 


p+l 

Hm 


Sm' ll’^^h,™' II 


_2p_ 

p+l 


m € Nm,& G Ns. 


(16) 


Third block minimization: Finally, one can now optimize the cost function of Problem (6) with 
respect to the codewords by mere substitution as shown below. 


inf ^7s,«[l-Ms.6/h(a:„)]+ gGNc^GNs (17) 

n 

On balance, as summarized in Algorithm 1, for each bit, the combined MM/BCD algorithm consists of 
one SVM optimization step, and two fast steps to optimize the MKL coefficients and codewords respectively. 
Once all model parameters a; have been computed in this fashion, their values become the current estimate 
(z.e., uj' <— u! ), the ^g,nS are accordingly updated and the algorithm continues to iterate until convergence is 
established^. Based on LIBSVM, which provides 0{N^) complexity [12], our algorithm offers the complexity 
0{BN^) per iteration , where B is the code length and N is the number of instances. 

4 Insights to Generalization Performance 

The superior performance of *SHL over other state-of-the-art hash function learning approaches featured 
in the next section can be explained to some extend by noticing that *SHL training attempts to minimize 

MATLAB® implementation of our framework is available at 
https://github.com/yinjiehuang/StarSHL 
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the normalized (by B) expected Hamming distance of a labeled sample to the correct codeword, which 
is demonstarted next. We constrain ourselves to the case, where the training set consists only of labeled 
samples (ie., N = Ml, Mu = 0) and, for reasons of convenience, to a single-kernel learning scenario, where 
each code bit is associated to its own feature space "Hf, with corresponding kernel function fcf,. Also, due to 
space limitations, we provide the next result without proof. 

Lemma 1. Let X be an arbitrary set, = {f : x ^ f(a;) G X G X}, —>■ R &e L-Lipschitz 

continuous w.r.t then 


(18) 

where o stands for function composition, ^n{G) — Eo- {sup^gg cr„ 5 (a;„. In)} is the empirical Rademacher 
complexity of a set Q of functions, {xn,ln} “re i.i.d. samples and an are i.i.d random variables taking values 
with Pr{an = ±1} = 

To show the main theoretical result of our paper with the help of the previous lemma, we will consider the 
sets of functions 


:FHi:x^[h{x),...,fB{x)f,fb&J^b,bGNB} (19) 

Xb ={fb ■ X {wb, 4>b(.x))^^ + fib, Pb gR s.t. \Pb\ < Mb, 

Wb G Ub s.t. ||wb||„^ <Rb, bG Nb} (20) 

Theorem 1. Assume reproducing kernels of {TLbfu^i s.t. kb{x,x') < r^, \/x,x' G X. Then for a fixed value 
of p > 0, for any i G T, any R'l ^ i5 > 0, with probability 1 — 5, it holds that: 


er{i,Hi) < er (f,/x;) 


2 r 


pB^N 




'Mil 

2N 


( 21 ) 


where er (f,/X;) = iE{d (sgn (f (x),/^;))}, I G'Ha is the true label ofx G X, er (f,/X;) = ^ J2n,bQp {fb{xn)h'ir,,b), 
where Qp{u) = min |l, max |o, 1 — . 

Proof. Notice that 


^d (sgn (f (x), Hi)) = < 0 ] < ^ ^ Qp {fb{x)pi,b) 


E <! ^d (sgn (f (x), /!,)) [ < E <j ^ Qp {fb{x)ppb) 


( 22 ) 


Consider the set of functions 


d/ = {^ : (x, 1) i-> — ^2 Qp {fb{x)pifi) ,i G F, ppb G {±1}, I G Ng, b G N_b} 

b 

Then from Theorem 3.1 of [15] and Equation (22), Wip S dl, 35 > 0, with probability at least 1 — 5, we have: 


er{i,Hi) < er{i,Hi) + 25Riv(«') -b 



(23) 
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where Kat (^') is the Rademacher complexity of 'h. From Lemma 1, the following inequality between empirical 
Rademacher complexities is obtained 


(24) 

where = {(cc, 1) !->■ [fi{x)fii,i,, f & ^ o,n-d £ {±1}}- The right side of Equation (24) 
can be upper-bounded as follows 


( 11 -^^ 111 )=]^®^' 


sup \fJ‘ln,bfb{Xn)\ 

} 6 HB „ , 


= — <1 sup ^ cr„ \fbiXn)\ 


= iE„ 




sup 


N [uJieHt,\\u>t\\.H<Rb,\Pb\<Mt 


= iE„ 


sup 


N \oJieHb,\M^^<Rb,\l3b\<Mt 


I {wb,(t>b{x)).^^ -f /3t,| / 

’ n b j 

I {wb,sgn{Pb)Mx))Ht + \^f>\\ 


= ^E, 


„ i sup V[R6\/ cr^Kbcr + \f3i,\ V ( 

^ il/3d<M.V n 

= ^E^lYRbV^^KbO- 


Jensen’s Ineq. 1 _ j - 


= -^Y^b 


(25) 


From Equation (24) and Equation (25) we obtain Kat (4') < J2b ^b- Since SRat (4>) = E^ j^Ar (4')|, 

where E^ is the expectation over the samples, we have 


The final result is obtained by combining Equation (23) and Equation (26). □ 

It can be observed that, minimizing the loss function of Problem (6), in essence, also reduces the bound 
of Equation (21). This tends to cluster same-class hash codes around the correct codeword. Since samples 
are classified according to the label of the codeword that is closest to the sample’s hash code, this process 
may lead to good recognition rates, especially when the number of samples N is high, in which case the 
bound becomes tighter. 


5 Experiments 

5.1 Supervised Hash Learning Resnlts 

In this section, we compare *SHL to other state-of-the-art hashing algorithms: Kernel Supervised Learning 
(KSH) [13], Binary Reconstructive Embedding (BRE) [10], single-layer Anchor Graph Hashing (1-AGH) and 
its two-layer version (2-AGH) [14], Spectral Hashing (SPH) [26] and Locality-Sensitive Hashing (LSH) [4]. 

Five datasets were considered: Pendigits and USPS from the UCI Repository, as well as Mnist, PAS- 
CAL07 and CIFAR-10. For Pendigits (10,992 samples, 256 features, 10 classes), we randomly chose 3,000 








Pendigits Pendigits Pendigits 





Figure 1: The top s retrieval results and Precision-Recall curve on Pendigits dataset over *SHL and 6 other 
hashing algorithms, (view in color) 
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Figure 2: The top s retrieval results and Precision-Recall curve on USPS dataset over *SHL and 6 other 
hashing algorithms, (view in color) 


samples for training and the rest for testing; for USPS (9,298 samples, 256 features, 10 classes), 3000 were 
used for training and the remaining for testing; for Mnist (70, 000 samples, 784 features, 10 classes), 10, 000 
for training and 60,000 for testing; for CIFAR-10 (60,000 samples, 1,024 features, 10 classes), 10,000 for 
training and the rest for testing; finally, for PASCAL07 (6878 samples, 1,024 features after down-sampling 
the images, 10 classes), 3,000 for training and the rest for testing. 

For all the algorithms used, average performances over 5 runs are reported in terms of the following 
two criteria: (i) retrieval precision of s-closest hash codes of training samples; we used s = {10,15,..., 50}. 
(ii) Precision-Recall (PR) curve, where retrieval precision and recall are computed for hash codes within a 
Hamming radius of r € Ns. 

The following *SHL settings were used: SVM’s parameter C was set to 1000; for MKL, 11 kernels were 
considered: 1 normalized linear kernel, 1 normalized polynomial kernel and 9 Gaussian kernels. For the 
polynomial kernel, the bias was set to 1.0 and its degree was chosen as 2. For the bandwidth a of the 
Gaussian kernels the following values were used: [2“^, 2“®, 2“^, 2“^, 1, 2^, 2^, 2®, 2^]. Regarding the MKL 
constraint set, a value of p = 2 was chosen. For the remaining approaches, namely KSH, SPH, AGH, BRE, 
parameter values were used according to recommendations found in their respective references. All obtained 
results are reported in Figure 1 through Figure 5. 

We clearly observe that *SHL performs best among all the algorithms considered. For all the datasets, 
*SHL achieves the highest top-10 retrieval precision. Especially for the non-digit datasets {CIFAR-10, 


9 





































































Mnist 


Mnist 


Mnist 




Figure 3: The top s retrieval results and Precision-Recall curve on Mnist dataset over *SHL and 6 other 
hashing algorithms, (view in color) 
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Figure 4: The top s retrieval results and Precision-Recall curve on CIFAR-10 dataset over *SHL and 6 other 
hashing algorithms, (view in color) 


PASCAL07), *SHL achieves significantly better results. As for the PR-curve, *SHL also yields the largest 
areas under the curve. Although noteworthy results were reported in [13] for KSH, in our experiments 
*SHL outperformed it across all datasets. Moreover, we observe that supervised hash learning algorithms, 
except BRE, perform better than unsupervised variants. BRE may need a longer bit length to achieve better 
performance as implied by Figure 1 and Figure 3. Additionally, it is worth pointing out that *SHL performed 
remarkably well for short big lengths across all datasets. 

It must be noted that AGH also yielded good results, compared with other unsupervised hashing algo¬ 
rithms, perhaps due to the anchor points it utilizes as side information to generate hash codes. With the 
exception of *SHL and KSH, the remaining approaches exhibit poor performance for the non-digit datasets 
we considered. 

When varying the top-s number between 10 and 50, once again with the exception of *SHL and KSH, the 
performance of the remaining approaches deteriorated in terms of top-s retrieval precision. KSH performs 
slightly worse, when s increases, while *SHL’s performance remains robust for CIFAR-10 and PSACAL07. 
It is worth mentioning that the two-layer AGH exhibits better robustness than its single-layer version for 
datasets involving images of digits. Finally, Figure 6 shows some qualitative results for the CIFAR-10 
dataset. In conclusion, in our experimentation, *SHL exhibited superior performance for every code length 
we considered. 


10 































































PASCAL 07 



PASCAL 07 








- KSH 


\ 


□nc 


‘CM 


- Vrii 


- - 




0.1 I-'-'-'-'-'-'-'-1 

10 15 20 25 30 35 40 45 50 

Number of Tops 


PASCAL 07 



Figure 5: The top s retrieval results and Precision-Recall curve on PASCAL07 dataset over *SHL and 6 
other hashing algorithms, (view in color) 


5.2 Transductive Hash Learning Results 

As a proof of concept, in this section, we report a performance comparison of our framework, when used 
in an inductive versus a transductive [23] mode. Note that, to the best of our knowledge, no other hash 
learning approaches to date accommodate transductive hash learning in a natural manner like *SHL. For 
illustration purposes, we used the Vowel and Letter datasets. We randomly chose 330 training and 220 test 
samples for the Vowel and 300 training and 200 test samples for the Letter. Each scenario was run 20 times 
and the code length (B) varied from 4 to 15 bits. The results are shown in Figure 7 and reveal the potential 
merits of the transductive *SHL learning mode across a range of code lengths. 

6 Conclusions 

In this paper we considered a novel hash learning framework with two main advantages. First, its Majorization- 
Minimization (MM)/Block Coordinate Descent (BCD) training algorithm is efficient and simple to imple¬ 
ment. Secondly, this framework is able to address supervised, unsupervised and, even, semi-supervised 
learning tasks in a unified fashion. In order to show the merits of the method, we performed a series of 
experiments involving 5 benchmark datasets. In these experiments, a comparison between *Supervised Hash 
Learning (*SHL) to 6 other state-of-the-art hashing methods shows *SHL to be highly competitive. 
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